4. Reference Distributions#

In this notebook we compute the reference distributions for each catchment in the sample to represent the baseline period of record (POR) flow duration curve for validation. The baseline is estimated by the non-parametric Kernel Density Estimation (KDE) method, a powerful and flexible and powerful way of estimating the probability density function (PDF) of a random variable. The KDE is computed using the period of record (POR) observations from all monitoring stations with \(\geq\) five complete years of record. The resulting distribution is used to compare against the distributions estimated by the ensemble methods.

4.1. Kernel Density Estimation#

The “ground truth” FDC is represented by Kernel Density Estimation (KDE) using period of record (POR) observations from all monitoring stations with \(\geq\) five complete years of record. Streamflow distributions come in ranges covering several orders of magnitude, the (unique) values making up the distribution can be very sparse in different regions of the range, and they can have different numbers of modes. These characteristics mean even representations based on several years of observation are only approximations, and even flexible and powerful representation methods like KDE have difficulty representing some data. As a result, we use an adaptive bandwidth based on an assumed “overall” flow measurement uncertainty model. The model is plotted below, and it represents very high uncertainty at the lowest flows that can reasonably be measured, and also higher uncertainty at high flows, representing typical uncertainty encountered in hydrological field measurement. Note that it does not represent an empirically derived model of error, rather here we use a sort of conservative approximation of error for the purpose of widening the bandwidth (gaussian kernel) where observations are typically sparse (at extremes) which coincides with higher uncertainty.

4.1.1. Adaptive Bandwidth Kernel#

A simulated probability density function is estimated from observations of k nearest neighbour stations. First, k simulated series are generated by equal unit area runoff , \(\hat p = \hat P \cdot w\) where \(\hat P = [\hat p_1, \hat p_2, \cdots, \hat p_k]\) and each \(\hat p_i = f(X_i(t))\).

In both cases, the function \(f \rightarrow \hat p(x)\) represents kernel density estimation, which defines the probability density as $\(\hat p(x) = \frac{1}{n} \sum_{i=1}^{n}\frac{1}{h(x)}K\left( \frac{x-x_i}{h(x)}\right)\)$

Where \(h(x)\) reflects an adaptive kernel bandwidth that addresses vestiges of precision in the observed data to reflect the nature of streamflow as a continuous variable, and additionally incorporates piecewise linear model to represent overall measurement uncertainty.

4.1.2. Sparse support#

Through exploration of many example PDF approximations by KDE, a second vestigial problem became evident in terms of sparse support. Despite requiring a minimum of 5 years of daily records, some observation sets have extremes that cause instability in the resulting PDF if the KDE bandwidth uses a “rule of thumb”, i.e. Silverman (constant) bandwidth approximation. In some cases using the bandwidth set by the error model accounts for smoothing the distribution around sparse support (high precision compared to density of observed values), however we opted to use a second bandwidth criterion based on the distance between unique values in the dataset. The final bandwidth is then the greater of that set by the assumed error or by the precision.

from multiprocessing import Pool
from functools import partial

import os
import pandas as pd
import json
import numpy as np
import geopandas as gpd
import xarray as xr

from scipy.stats import laplace

from pathlib import Path
from bokeh.plotting import figure, show, gridplot
from bokeh.io import output_notebook
import numpy as np
output_notebook()


from utils.kde_estimator import KDEEstimator
import utils.data_processing_functions as dpf
BASE_DIR = os.getcwd()

# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')
Loading BokehJS ...
# plot the measurement error:
efig = figure(title="Estimated Measurement Error Model", width=600, height=400, x_axis_type='log')
error_points = np.array([1e-4, 1e-3, 1e-2, 1e-1, 1., 1e1, 1e2, 1e3, 1e4, 1e5])  #  Reference flow points in m^3/s
error_values = np.array([1.0, 0.5, 0.2, 0.1, 0.1, 0.1, 0.1, 0.15, 0.2, 0.25])
efig.line(error_points, error_values, line_color='red', line_width=2, legend_label='Measurement Error Model')
efig.xaxis.axis_label = r'$$\text{Flow } m^3/s$$'
efig.yaxis.axis_label = r'$$\text{Error } [\text{x}100\%]$$'
efig.legend.background_fill_alpha = 0.5
efig = dpf.format_fig_fonts(efig, font_size=14)
layout = gridplot([efig], ncols=2, width=500, height=350)
show(layout)

4.2. Data pre-processing overview#

Note that these steps are optional and the end results of these pre-processing steps are provided in the open data repository.

4.2.1. get updated data sources and validate catchment attributes#

  1. Extract catchment attributes using updated catchment geometries where available (optional, updated catchment geometries are saved in data/BCUB_watershed_bounds_updated.geojson).

  2. Process climate indices for HYSETS catchments in the study region (optional, pre-processed attributes are contained in BCUB_watershed_attributes_updated.csv)

4.3. Import HYSETS catchment attributes#

# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
ws_id_dict = hysets_df.set_index('Official_ID')['Watershed_ID'].to_dict()
da_dict = hysets_df.set_index('Official_ID')['Drainage_Area_km2'].to_dict()
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

4.3.1. Import the pre-filtered stations within the study region#

# import the BCUB (study) region boundary
bcub_df = gpd.read_file(os.path.join('data', f'catchment_attributes_with_runoff_stats.csv'), dtype={'official_id': str})
bcub_df['official_id'] = bcub_df['official_id'].astype(str)
# map the Hysets watershed IDs to the BCUB watershed IDs
# create a dict to map HYSETS watershed IDs to the Official station IDs
bcub_df['watershedID'] = bcub_df['official_id'].apply(lambda x: official_id_dict.get(x, None))
print(f'   Found {len(bcub_df)} catchments in the BCUB region with runoff statistics.')
   Found 1098 catchments in the BCUB region with runoff statistics.
/home/danbot/code/data_analysis/lib/python3.12/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver CSV does not support open option DTYPE
  return ogr_read(

4.4. Import streamflow timeseries#

Note

At the top of data_processing_functions.py, update the STREAMFLOW_DIR variable to match where the HYSETS streamflow time series are stored.

# Load dataset
streamflow = xr.open_dataset(HYSETS_DIR / 'HYSETS_2023_update_QC_stations.nc')

# Promote 'watershedID' to a coordinate on 'watershed'
streamflow = streamflow.assign_coords(watershedID=("watershed", streamflow["watershedID"].data))

# Set 'watershedID' as index
streamflow = streamflow.set_index(watershed="watershedID")

# Select only watershedIDs present in bcub_df
valid_ids = [int(wid) for wid in bcub_df['watershedID'].values if wid in streamflow.watershed.values]
ds = streamflow.sel(watershed=valid_ids)

USGS station IDs are integers but they are stored in the dataset with the unfortunate characteristic that different stations can have identifiers that are substrings of each other. We have to add a few extra lines of code to ensure we get the correct file.

# Confirm all watershed IDs exist in ds
bcub_ws_ids = bcub_df['watershedID'].values
ds_ids = ds.watershed.values  # After selection via sel(watershed=valid_ids)

missing = [wid for wid in bcub_ws_ids if wid not in ds_ids]
n_total = len(bcub_ws_ids)
n_missing = len(missing)

assert n_missing == 0, f"{n_missing} / {n_total} watershedIDs missing from dataset. First few missing: {missing[:5]}"

4.5. Calculate Period of Record (POR) probability distribution functions for each station record#

In order to compare distributions fairly without data leakage in the subsequent prediction steps, we will first define a “global evaluation grid”, or the support over which all PDFs will be evaluated. We’ll do this by setting a global expected range of flow on a unit area basis. Since we clipped the lowest flows to 1e-4 \(m^3/s\) at import, and we know the minimum drainage area in the dataset is 0.7 \(\text{km}^2\), we can set the minimum to $10^{-4} and find the minimum and maximum unit area runoff in the dataset and leave some margin.

Since we’ll be computing KL divergences, we will also assume a prior distribution based on the heavy tailed Laplace distribution using mean and standard deviation (log) unit area runoff predicted (out of sample) from catchment attributes. This prior will be assumed on the donor/proxy PMF to avoid division by zero where the support gets very small.

# import the predicted mean and standard deviation from the previous notebook (Predict Runoff Statistics)
predicted_params = pd.read_csv('data/results/parameter_prediction_results/mean_parameter_predictions.csv', index_col=0)
param_dicts = {
    'mean': predicted_params['log_uar_mean_mean_predicted'].to_dict(),
    'sd': predicted_params['log_uar_std_mean_predicted'].to_dict(),
    'median': predicted_params['log_uar_median_mean_predicted'].to_dict(),
    'mad': predicted_params['log_uar_mad_mean_predicted'].to_dict(),
}

4.6. Define a global range of expected (unit area) runoff#

def retrieve_timeseries_discharge(stn):
    watershed_id = official_id_dict[stn]
    da = da_dict[stn]
    try:
        df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    except KeyError:
        print(f"Warning: Station {stn} not found in dataset under watershedID {watershed_id}.")
        return pd.DataFrame()
    
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True)
    df['zero_flow_flag'] = df['discharge'] == 0
    # df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    # df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_uar'] = 1000 * df['discharge'] / da
    df[f'{stn}_mm'] = df['discharge'] * (24 * 3.6 / da)
    df['replaced_zero_flow_uar'] = df['discharge'].clip(1e-4) * (1000 / da)
    df['log_uar'] = np.log(df['replaced_zero_flow_uar'])
    return df

def process_station(stn, da_dict):
    df = retrieve_timeseries_discharge(stn)
    uar = df[f'replaced_zero_flow_uar'].dropna()
    return stn, uar.min(), uar.max()

def parallel_min_max(bcub_stations, da_dict, processes=None):
    processes = processes or 18
    with Pool(processes=processes) as pool:
        fn = partial(process_station, da_dict=da_dict)
        results = pool.map(fn, bcub_stations)

    global_min = float('inf')
    global_max = float('-inf')
    results = [res for res in results if res[1]]
    for stn, min_uar, max_uar in results:
        if max_uar > global_max:
            global_max = max_uar
            print(f'Max streamflow for {stn}: {max_uar:.2f} L/s/km²')
        if min_uar < global_min:
            global_min = min_uar
            print(f'   Min streamflow for {stn}: {min_uar:.2e} L/s/km² (DA={da_dict[stn]:.2f} km²)')

    print(f"\n Global min = {global_min:.2e} L/s/km², max = {global_max:.2f} L/s/km²")


# Usage
compute_range = True
if compute_range:
    parallel_min_max(bcub_df['official_id'], da_dict)   
Process ForkPoolWorker-11:
Process ForkPoolWorker-3:
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[8], line 49
     47 compute_range = True
     48 if compute_range:
---> 49     parallel_min_max(bcub_df['official_id'], da_dict)   

Cell In[8], line 30, in parallel_min_max(bcub_stations, da_dict, processes)
     28 with Pool(processes=processes) as pool:
     29     fn = partial(process_station, da_dict=da_dict)
---> 30     results = pool.map(fn, bcub_stations)
     32 global_min = float('inf')
     33 global_max = float('-inf')

File /usr/lib/python3.12/multiprocessing/pool.py:367, in Pool.map(self, func, iterable, chunksize)
    362 def map(self, func, iterable, chunksize=None):
    363     '''
    364     Apply `func` to each element in `iterable`, collecting the results
    365     in a list that is returned.
    366     '''
--> 367     return self._map_async(func, iterable, mapstar, chunksize).get()

File /usr/lib/python3.12/multiprocessing/pool.py:768, in ApplyResult.get(self, timeout)
    767 def get(self, timeout=None):
--> 768     self.wait(timeout)
    769     if not self.ready():
    770         raise TimeoutError

File /usr/lib/python3.12/multiprocessing/pool.py:765, in ApplyResult.wait(self, timeout)
    764 def wait(self, timeout=None):
--> 765     self._event.wait(timeout)

File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
    653 signaled = self._flag
    654 if not signaled:
--> 655     signaled = self._cond.wait(timeout)
    656 return signaled

File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
    353 try:    # restore state no matter what (e.g., KeyboardInterrupt)
    354     if timeout is None:
--> 355         waiter.acquire()
    356         gotit = True
    357     else:

KeyboardInterrupt: 
def set_uar_grid(global_min, global_max, n_grid_points=2**12):
    baseline_log_grid = np.linspace(
        global_min, global_max, n_grid_points
    )
    baseline_lin_grid = np.exp(baseline_log_grid)
    log_dx = np.gradient(baseline_log_grid)
    max_step_size = baseline_lin_grid[-1] - baseline_lin_grid[-2]
    print(f'    max step size = {max_step_size:.1f} L/s/km^2 for n={n_grid_points:.1e} grid points')
    min_step_size = baseline_lin_grid[1] - baseline_lin_grid[0]
    print(f'    min step size = {min_step_size:.2e} L/s/km^2 for n={n_grid_points:.1e} grid points')
    return baseline_lin_grid, baseline_log_grid, log_dx
# set global bounding values of UAR from the 
# max_streamflow = ds['discharge'].max().values.item()
# max_streamflow = # it's actually 19400 in the dataset
baseline_lin_grid, baseline_log_grid, log_dx = set_uar_grid(np.log(1e-6), np.log(1e4), n_grid_points=2**12)
base_kde_estimator = KDEEstimator(baseline_log_grid, log_dx)
    max step size = 56.1 L/s/km^2 for n=4.1e+03 grid points
    min step size = 5.64e-09 L/s/km^2 for n=4.1e+03 grid points
# set the minimum record length
min_record_years = 5
# choose if you want to set a global prior distribution based on the 
# Laplace fit to (out of sample) predicted location and scale
set_global_prior = False
validated_stations = bcub_df[bcub_df['n_years'].astype(float) >= min_record_years]['official_id'].values
print(f'Filtering stations with at least {min_record_years} complete years of data: {len(validated_stations)}/{len(bcub_df)} stations.')

# theres a problem with finding '0212414900' in the data, drop it
validated_stations = [stn for stn in validated_stations if stn != '0212414900']

# load the complete year dictionary
with open('data/complete_years.json', 'r') as f:
    complete_year_dict = json.load(f)
Filtering stations with at least 5 complete years of data: 1098/1098 stations.
from scipy.stats import wasserstein_distance

class ReferenceDistribution:
    def __init__(self, **kwargs):

        for k, v in kwargs.items():
            setattr(self, k, v)

    def _initialize_station(self, stn):
        self.stn = stn
        self.df = retrieve_timeseries_discharge(stn)
        self.da = self.da_dict[stn]
        self.n_observations = len(self.df.dropna())
def compute_baseline_distribution_by_kde(inputs):
    stn, input_config = inputs
    baseline_distribution = ReferenceDistribution(**input_config)
    baseline_distribution._initialize_station(stn)
    kde_obj = input_config['kde_obj']
    data = baseline_distribution.df['replaced_zero_flow_uar'].dropna().values
    da = input_config['da_dict'][stn]
    pmf, pdf = kde_obj.compute(data, da)
    return (stn, pmf, pdf)
# from kde_estimator import KDEEstimator
pdf_path = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions' / f'bcub_pdfs.csv'
pmf_path = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions' / f'bcub_pmfs.csv'

shared_config = {
    'da_dict': da_dict,
    'complete_year_dict': complete_year_dict,
    'kde_obj': base_kde_estimator,
    }

if os.path.exists(pdf_path):
    pmf_df = pd.read_csv(pmf_path, index_col='q')
    pdf_df = pd.read_csv(pdf_path, index_col='q')
else:
    # compute the PDF and PMF for each station    
    inputs = [(stn, shared_config) for stn in validated_stations] 
    print(len(validated_stations), 'stations meeting minimum complete period of record.')
    
    # with Pool() as pool:
    #     results = pool.map(compute_baseline_distribution_by_kde, inputs)  
    results = []
    for ip in inputs:
        result = compute_baseline_distribution_by_kde(ip)      
        results.append(result)
        if len(results) % 100 == 0:
            print(f'Processed {len(results)}/{len(inputs)} stations...')
    # concatenate the results
    stations, pmfs, pdfs = zip(*results)
    pdf_df = pd.DataFrame(np.stack(pdfs, axis=1), columns=stations)
    pmf_df = pd.DataFrame(np.stack(pmfs, axis=1), columns=stations)

    pdf_df['q'] = baseline_lin_grid
    pmf_df['q'] = baseline_lin_grid

    # save the pdf and pmf files
    pdf_df.set_index('q', inplace=True)
    pmf_df.set_index('q', inplace=True)
    pdf_df.to_csv(pdf_path)
    pmf_df.to_csv(pmf_path)

4.6.1. Compare the Adaptive KDE PMFs vs. the FFTKDE PMF#

from utils.evaluation_metrics import EvaluationMetrics
from KDEpy import FFTKDE
eval_obj = EvaluationMetrics(baseline_log_grid, log_dx)
all_results = {}
for stn, input_config in inputs:
    akde_pmf = pmf_df[stn].values
    # assert the pdf sums to 1
    assert np.isclose(np.sum(akde_pmf), 1, atol=1e-4), f'PDF for {stn} does not sum to 1: {np.sum(akde_pmf):.4f}'

    baseline_distribution = ReferenceDistribution(**input_config)
    baseline_distribution._initialize_station(stn)
    kde_obj = input_config['kde_obj']
    data = baseline_distribution.df['replaced_zero_flow_uar'].dropna().values
    try:
        fft_kde_pdf = FFTKDE(bw="silverman").fit(np.log(data)).evaluate(baseline_log_grid)
    # catch warning from FFTKDE and print station ID
    except ValueError as e:
        print(f'Warning: FFTKDE failed for station {stn} with error: {e}')
        continue

    # convert the FFTKDE pdf to a PMF
    # Convert to PMF
    fft_pmf = fft_kde_pdf * log_dx
    fft_pmf /= np.sum(fft_pmf)
    assert np.isclose(np.sum(fft_pmf), 1, atol=1e-4), f'PMF for {stn} does not sum to 1: {np.sum(fft_pmf):.4f}'
    # compute measures between the adaptive KDE and the FFTKDE
    measures = eval_obj._evaluate_fdc_metrics_from_pmf(akde_pmf, fft_pmf)
    all_results[stn] = measures
    
/home/danbot/code/data_analysis/lib/python3.12/site-packages/KDEpy/bw_selection.py:285: UserWarning: Silverman's rule failed. Too many idential values. Setting bw = 0.2653109141748294
  warnings.warn(
/home/danbot/code/data_analysis/lib/python3.12/site-packages/KDEpy/bw_selection.py:285: UserWarning: Silverman's rule failed. Too many idential values. Setting bw = 0.3589583777837614
  warnings.warn(
# format to a DataFrame with the station ID as index
results_df = pd.DataFrame(all_results).T
results_df.index.name = 'station_id'
results_df.reset_index(inplace=True)
results_df.to_csv('data/results/kde_fft_comparison.csv', index=False)
results_df.sort_values(by=['kld'], inplace=True, ascending=False)
bad_rmse_stns = results_df['station_id'].values[:10]
results_df.head(10)
station_id rmse relative_error ve nse kge kld emd
9 05AB022 0.089766 0.321285 0.017087 0.998129 0.966668 0.478741 0.0991
921 12345000 44.001743 0.286702 -0.289957 0.659076 0.412635 0.398419 19.8344
935 12353820 18.310978 0.297060 -0.333375 0.644401 0.378055 0.394362 6.8054
503 08NL037 0.518950 0.210084 -0.052027 0.991399 0.902767 0.223512 0.1612
420 08NA056 0.056464 0.124036 0.008129 0.998570 0.975052 0.170880 0.1801
417 08NA052 1.238116 0.031092 -0.015836 0.995317 0.955860 0.098915 0.5796
738 12111500 11.219424 0.107840 -0.135755 0.879694 0.679531 0.097771 4.9752
418 08NA053 3.206015 0.046618 -0.027818 0.992550 0.935026 0.097306 1.3586
471 08NH100 7.452565 0.041910 -0.038559 0.983167 0.902817 0.072722 2.9773
953 12367500 0.347037 0.041836 -0.025561 0.987886 0.938282 0.067220 0.4951

4.6.2. Plot an example FDC#

stn = '08MH147'
stave_dist = ReferenceDistribution(**shared_config)
stave_dist._initialize_station(stn)
kde_obj = shared_config['kde_obj']

# take an example slice
df = stave_dist.df[stave_dist.df.index > '2020-01-01'].copy()
qlabel = 'replaced_zero_flow_uar'
# density 
data = df[qlabel].dropna().values
fft_kde_pdf = FFTKDE(bw="silverman").fit(np.log(data)).evaluate(baseline_log_grid)

# convert the FFTKDE pdf to a PMF
# Convert to PMF
fft_pmf = fft_kde_pdf * log_dx
fft_pmf /= np.sum(fft_pmf)
from bokeh.models import Label

label1 = r"$$x_1 \leq x_2 \leq \cdots \leq x_N,$$"
label2 = r"$$x = F(q_t) = \frac{1}{N}, \dots, \frac{n}{N}$$"

fdc_label1 = Label(
    x=sorted_vals[int(len(sorted_vals) * 0.8)],  # place at ~80th percentile of flow
    y=2400,  # 80% exceedance
    text=label1,
    text_font_size="10pt",
    text_color="black",
    # render_mode="canvas",
)
fdc_label2 = Label(
    x=sorted_vals[int(len(sorted_vals) * 0.8)],  # place at ~80th percentile of flow
    y=1900,  # 80% exceedance
    text=label2,
    text_font_size="10pt",
    text_color="black",
    # render_mode="canvas",
)
from bokeh.layouts import row

# Plot A: Time Series

p1 = figure(title="", x_axis_type="datetime", width=350, height=300, toolbar_location=None)
p1.line(df.index, df[qlabel], line_width=2)
p1.yaxis.axis_label = r"$$\text{UAR } [L/s/\text{km}^2]$$"
p1.xaxis.major_label_orientation = np.pi / 4  # rotate x-axis labels

# Plot B: PDF
p2 = figure(title='', width=330, height=300, x_axis_type='log', toolbar_location=None)
p2.line(baseline_lin_grid[2500:], fft_kde_pdf[2500:], line_width=2, legend_label='PDF')

p2.yaxis.axis_label = r"$$\text{Density}$$"
p2.xaxis.axis_label = r"$$\text{UAR } [L/s/\text{km}^2]$$"

# Plot C: Duration Curve (Exceedance)
sorted_vals = np.sort(df[qlabel].dropna().values)[::-1]
exceed_prob = np.linspace(0, 1, len(sorted_vals), endpoint=False)

p3 = figure(title="", width=340, height=300, toolbar_location=None)
p3.line(exceed_prob * 100, sorted_vals, line_width=2)
p3.xaxis.axis_label = r"$$\text{Exceedance Probability} (\%)$$"
p3.yaxis.axis_label = r"$$\text{UAR } [L/s/\text{km}^2]$$"
# p3.yaxis.visible=False
# p3.add_layout(fdc_label1)
# p3.add_layout(fdc_label2)
p1 = dpf.format_fig_fonts(p1, font_size=14)
p3 = dpf.format_fig_fonts(p3, font_size=14)
lt = gridplot([p1, p3, p2], ncols=3, width=350, height=300)
# lt = row([p1, p3, p2])#, ncols=3, width=350, height=300)
show(lt)

4.6.3. Compute the entropy of each PMF#

Note

The remainder of this notebook is not part of the analysis presented in the accompanying paper.

The entropy of the distribution represents the randomness or disorder of the system, and it is given by the sum of log probabilities of the system states:

\[H(X) = \sum_{i=1}^N -\log p_i\]
# compute the entropy of the prior adjusted distribution for each station

bits = list(range(2, 13)) # set a range that is both too low and too high for the data
entropy_output_folder = Path(os.getcwd()) / 'data' / 'results' / 'entropy_results'
if not entropy_output_folder.exists():
    entropy_output_folder.mkdir(parents=True, exist_ok=True)

states = [2**b for b in bits]
eps = 1e-22 # set a small epsilon to avoid numerical issues
for s in states:
    q_values = pmf_df.index.values
    # resample the PMF by q_values to the number of states
    resampled_q = np.linspace(np.log(q_values.min()-eps), np.log(q_values.max()+eps), s)
    # resampled_q = np.exp(resampled_q)  # convert back to original scale
    # create a new DataFrame with the resampled PMF
    pmf_resampled = pd.DataFrame(
        index=np.exp(np.linspace(np.log(q_values.min()),
                                 np.log(q_values.max()),
                                 s)),
        columns=pmf_df.columns,
        dtype=float
    )
    for stn in pmf_df.columns:
        pmf = pmf_df[stn].values
        bin_numbers = np.digitize(q_values, pmf_resampled.index.values)
        tmp = pd.DataFrame({'pmf': pmf, 'bin': bin_numbers})
        agg = tmp.groupby('bin')['pmf'].sum()
        full_pmf = agg.reindex(range(1, s+1), fill_value=0)
        pmf_resampled[stn] = full_pmf.values
        assert np.isclose(full_pmf.sum(), 1), f'PMF for {stn} does not sum to 1: {np.sum(full_pmf):.6f}'

    bits = int(np.log2(s))
    pmf_resampled.index.name = 'q'
    pmf_resampled.to_csv(entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv')
from bokeh.palettes import RdYlGn

def logspace_edges_from_linear_centers(centers):
    """Given linear-space bin centers from a log-uniform KDE, return log-space edges in linear space."""
    log_centers = np.log(centers)
    dlog = log_centers[1] - log_centers[0]
    log_edges = np.concatenate([
        [log_centers[0] - dlog / 2],
        log_centers + dlog / 2
    ])
    return np.exp(log_edges)  # return edges in linear space


pdf_fig = figure(title="BCUB PDF Resampled", width=750, height=400, x_axis_type='log')
entropy_distributions = {}

for s in states:
    bits = int(np.log2(s))
    pmf_path = entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv'
    pmf_resampled = pd.read_csv(pmf_path, index_col=0)

    pmf = np.percentile(pmf_resampled, 50, axis=1)  # median PMF across stations

    # Compute bin edges and widths from linear bin centers
    centers = pmf_resampled.index.astype(float).values
    edges = logspace_edges_from_linear_centers(centers)
    dx = np.diff(edges)

    # Convert PMF to PDF and normalize
    pdf = pmf / dx
    x_centers = (edges[:-1] + edges[1:]) / 2
    pdf /= np.trapezoid(pdf, x=x_centers)  # ensure integral = 1

    # Entropy of PMF (still valid for info content calc)
    mask = pmf > 0
    entropy = -np.sum(pmf[mask] * np.log2(pmf[mask]))
    ratio = entropy / bits

    # Plot PDF
    pdf_fig.quad(
        top=pdf, bottom=0,
        left=edges[:-1], right=edges[1:],
        fill_color=RdYlGn[11][states.index(s) % len(RdYlGn[11])],
        line_color=None, alpha=0.7,
        legend_label=f'{bits:.0f}b (H={entropy:.1f}, {100*ratio:.0f}%)'
    )

# Final formatting
pdf_fig.legend.click_policy = 'hide'
pdf_fig.legend.location = 'top_right'
pdf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
pdf_fig.yaxis.axis_label = "Probability Density"
pdf_fig = dpf.format_fig_fonts(pdf_fig, font_size=14)

show(pdf_fig)
def compute_empirical_cdf(values):
    """Compute the empirical cumulative distribution function (CDF) of the given values."""
    sorted_values = np.sort(values)
    cdf = np.arange(1, len(sorted_values) + 1) / len(sorted_values)
    return sorted_values, cdf
from bokeh.palettes import Viridis10
# plot the CDFs of entropy by number of bits
entropy_dist_fig = figure(title="Entropy Distributions by Number of Bits", width=750, height=400)
colours = Viridis10
for s in states:
    bits = int(np.log2(s))
    pmf_resampled = pd.read_csv(entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv', index_col=0)
    # drop the 'q' column
    # pmf = pmf_resampled.drop(columns=['q']).values
    # broadcast the computation of entropy across all columns
    # to compute the entropy for each column
    pmf = pmf_resampled.values
    entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
    sample_entropy = -np.sum(entropy_by_column, axis=0) 
    mean_entropy = np.mean(sample_entropy)
    x, cdf = compute_empirical_cdf(sample_entropy)

    entropy_dist_fig.line(x, cdf, line_width=3, color=colours[states.index(s) % len(colours)],
                          legend_label=f'{bits:.0f}b (H={mean_entropy:.1f})')
entropy_dist_fig.legend.location = 'bottom_right'
entropy_dist_fig.legend.background_fill_alpha = 0.5
entropy_dist_fig.xaxis.axis_label = "Entropy (bits)"
entropy_dist_fig.yaxis.axis_label = "Cumulative Probability"
entropy_dist_fig.legend.click_policy = 'hide'
entropy_dist_fig = dpf.format_fig_fonts(entropy_dist_fig, font_size=14)
show(entropy_dist_fig)
def logspace_edges_from_linear_centers(centers):
    """Given linear-space bin centers from a log-uniform KDE, return log-space edges in linear space."""
    log_centers = np.log(centers)
    dlog = log_centers[1] - log_centers[0]
    log_edges = np.concatenate([
        [log_centers[0] - dlog / 2],
        log_centers + dlog / 2
    ])
    return np.exp(log_edges)  # return edges in linear space
from bokeh.palettes import Viridis10 as colours
pmf_fig = figure(title=" PDF by Percentile", width=1000, height=400, x_axis_type='log')
percentiles = 2, 10, 25, 50, 75, 90, 95, 96, 98, 99
for pct in percentiles:
    pmf_resampled = pd.read_csv(entropy_output_folder / f'bcub_pmf_resampled_8bits.csv', index_col=0)
    pmf = np.percentile(pmf_resampled, pct, axis=1)

    linear_centers = pmf_resampled.index.astype(float).values
    edges = logspace_edges_from_linear_centers(linear_centers)
    dx = np.diff(edges)  # linear bin widths corresponding to log-space bins
    pdf = pmf / dx  # convert PMF to PDF
    # compute the area under the PDF
    area = np.trapezoid(pdf, x=edges[:-1])
    pdf = pdf / area  # normalize the PDF to have unit area
    area = np.trapezoid(pdf, x=edges[:-1])
    assert np.isclose(area, 1), f'Area under PDF for {pct}th percentile does not equal 1: {area:.6f}'
    # print(asdf)
    entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
    sample_entropy = -np.sum(entropy_by_column, axis=0) 
    
    # print(len(dx), len(edges), len(pmf))
    # print(asdf)
    bits = np.log2(s)
    ratio = entropy / bits
    # add an edge at the end to close the PMF
    
    pmf_fig.line(
        x=edges[:-1],
        y=pdf,
        line_width=4,
        color=colours[percentiles.index(pct) % len(colours)],
        legend_label=f'{pct}th Percentile (H={entropy:.1f} {100*ratio:.0f}%)'
    )
    pmf_fig.legend.click_policy = 'hide'
    pmf_fig.legend.location = 'top_right'
    # pmf_fig.xaxis.axis_label = r'$$\text{Unit Area Runoff } \frac{L}{s\cdot \text{km}^2}$$'
    # pmf_fig.yaxis.axis_label = r'$$\text{PDF } P(X\leq x)$$'
    pmf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
    pmf_fig.yaxis.axis_label = "Probability Density"
    pmf_fig.legend.background_fill_alpha = 0.5
    pmf_fig.add_layout(pmf_fig.legend[0], 'right')
    pmf_fig = dpf.format_fig_fonts(pmf_fig, font_size=14)
show(pmf_fig)
from bokeh.models import LogColorMapper, ColorBar, FixedTicker, ColumnDataSource, LinearColorMapper
from bokeh.palettes import Viridis256

percentiles = np.linspace(0.01, 99.9, 500)
exceedance_probs = 1 - (percentiles / 100)  # Convert percentiles to exceedance probabilities
pmf_resampled = pd.read_csv(entropy_output_folder / 'bcub_pmf_resampled_11bits.csv', index_col=0)
assert np.allclose(pmf_resampled.sum(), 1), "PMF does not sum to 1 across all stations."
linear_centers = pmf_resampled.index.astype(float).values
edges = logspace_edges_from_linear_centers(linear_centers)
dx = np.diff(edges)
x_vals = edges[:-1]
log_x = np.log10(x_vals)

z_matrix = []
for pct in percentiles:
    pmf = np.percentile(pmf_resampled, pct, axis=1)
    # pmf /= np.sum(pmf)
    z_matrix.append(pmf)

z_image = np.flipud(np.array(z_matrix))

# Use log color mapping for dynamic range of PDF values
mapper = LogColorMapper(palette=Viridis256, low=z_image[z_image > 0].min(), high=z_image.max())
# Create figure with x-axis in log space (via manual transformation)
p = figure(
    title="PMF across Large Sample of Watersheds",
    # x_range=(log_x.min(), log_x.max()),
    x_range=(x_vals.min(), x_vals.max()),
    y_range=(0.01, 99.9),
    # y_range=(np.min(exceedance_probs), np.max(exceedance_probs)),
    width=1000,
    height=500,
    x_axis_type="log",
)

p.image(
    image=[z_image],
    x=x_vals.min(),
    y=1,
    dw=x_vals.max() - x_vals.min(),
    dh=98,
    color_mapper=mapper
)

# Axis labels
p.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
p.yaxis.axis_label = "Percentile"

# Color bar
color_bar = ColorBar(title="Probability Mass",  color_mapper=mapper, label_standoff=12)
p.add_layout(color_bar, 'right')

show(p)
mean_pmf = pmf_resampled.mean(axis=1)  # Mean PMF per bin

# Create a new figure for the mean PDF
mean_pdf_fig = figure(title="Mean PDF of Unit Area Runoff", width=1000, height=400, x_axis_type='log')
mean_pdf_fig.line(x=x_vals, y=mean_pmf, line_width=2, color='blue', legend_label='Mean PDF')
mean_pdf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
mean_pdf_fig.yaxis.axis_label = "Probability Density"
mean_pdf_fig.legend.location = 'top_right'
mean_pdf_fig.legend.click_policy = 'hide'
mean_pdf_fig = dpf.format_fig_fonts(mean_pdf_fig, font_size=14)
show(mean_pdf_fig)

4.7. Next steps#

In the subsequent chapters, we’ll use the KL divergence computed for each station in a gradient boosting model to test their predictability from catchment attributes.

4.8. Citations#